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takes  P(n  )  units  of  time.  For  any  sequence  that  approximates  a 
geometric  progression  with  an  integer  common  ratio,  this  bound  is  the 
best  possible.  (The  notion  of  "sorting  template"  is  used  to  prove  this.) 
However,  if  the  sequence  consists  of  the  descending  sequence  of  positive 
integers  less  than  n  and  having  only  2  and  3  as  prime  factors, 
then  Shellsort  takes  only  0(n  log2  n)  units  of  time.  Sorting  networks 
based  on  Shellsort  with  this  sequence  operate  approximately  1.5  times 
as  fast  as  with  previous  methods . 


This  research  was  supported  in  part  by  the  National  Science  Foundation 
under  grant  number  GJ  992,  and  the  Office  of  Naval  Research  under  grant 
number  N -000 It- 67 -A -0112 -00 57  NR  0kk-h02.  Reproduction  in  whole  or  in 
part  is  permitted  for  any  purpose  of  the  United  States  Government. 


ii 


ACKNOWLEDGMENTS 


My  sincere  thanks  to  Professor  Richard  Karp  and  ray  advisor 
Professor  Donald  Knuth  for  stimulating  ray  interest  in  this  area  and 
for  very  helpful  discussions;  to  Professors  Robert  Floyd  and  Harold 
Stone  for  graciously  consenting  to  read  the  thesis;  to  the  members 
of  ray  crals  committee  for  their  time  and  attention  in  this  duty;  to 
the  Departments  of  Computer  Science  at  the  University  of  California, 
Berkeley,  and  at  Stanford  University,  for  essential  support  in  the 
form  of  research  assistantships ;  and  to  Phyllis  Winkler  for  her 
enviably  accurate  and  fast  execution  of  the  unenviable  task  of  typing 
this  thesis. 


iii 


Table  of  Contents 


Chapter  1.  Introduction  to  Shellsort .  1 

Chapter  2.  Least  Upper  Bounds  for  Most  Shellsort s .  5 

2.1  Discussion .  5 

2.2  An  Upper  Bound  for  Most  Shellsort s .  7 

2.3  Optimality  of  the  0(n^2)  Bound . 17 

2 

Chapter  3.  An  n  log  n  Shellsort . 27 

Chapter  4.  A  Shell  Sorting  Network . 35 

4.1  Sorting  Networks . 35 

4.2  Shellsort  with  Standard  Comparators . 37 

4.3  A  Faster  Network . 39 

Chapter  5-  Epilogue . 56 

5.1  Summary  and  Suggested  Problems . 56 

5.2  Conclusions  and  Perspective  .  58 


Bibliography 


59 


List  of  Illustrations 


Figure  3.1  A  triangular  array  of  elements  of  >  0)  .  .  2 9 

Figure  5.2(a)  A  triangle  covered  by  squares  .  30 

Figure  3.2(b)  A  triangular  extension . 30 

Figure  4.1  Sorting  network  for  four  elements . 35 

Figure  4.2  An  abbreviated  comparator . 36 

Figure  4.3  An  abbreviated  sorting  network  for  four  elements  .  .  36 

Figure  4.4  A  sorting  network  for  a  6-element  p-chain . 38 

Figure  4.5  38 

Figure  4.6  39 

Figure  4.7  A  comparator  for  zeroes  and  ones . 40 

Figure  4.8  Notation  for  Figure  4.7 . 40 

Figure  4.9(a)  A  sorting  network  for  one  p-chain . 42 

Figure  4.9(b)  Implementation  of  Figure  4.9(a) . 42 

Figure  4.10(a)  Implementation  of  Figure  4.9(a)  using  median-finders  44 

Figure  4.10(b)  Same  as  (a)  with  registers  R  added . 44 

Figure  4.11  Implementation  of  an  R  box . 45 

Figure  4.12  An  implementation  of  a  median  finder  M . 48 

Figure  4.13  Equivalent  circuits  for  possible  states  of  (R,R’)  .  49 

Figure  4.14(a)  Structure  of  a  comparator . 52 

Figure  4.14(b),  (c)  MIN,  MAX  circuits . 52 

Figure  4.15  M  with  triplicated  OR  gates . 55 


v 


i  vm^i^W 


Chapter  1 

Introduction  to  SheUsort 
The  problem  is  to  sort  the  elements  of  the  array 
A  A|  l],Al2j, . .  .,A[n  |  into  ascending  order,  g iver.  come  total  ordering 
on  the  possible  values  of  the  elements  of  A  .  The  high  cost  of 
random-access  memory  together  with  the  speed  of  in-core  sorting 
motivates  the  consideration  of  algorithms  that  sort  arrays  "in  their 
own  length",  with  little  or  no  auxiliary  storage  requirements  beyond 
what  is  needed  to  hold  the  array.  A  number  of  such  algorithms  are 
known,  and  all  but  Shellsort  [Shell,  1959]  have  proved  more  or  less 
amenable  to  an  analysis  of  bounds  on  their  running  time,  as  a  function 
of  n  .  Chapter  2  shows  that  0(n^2)  unitB  of  time  is  the  best 
possible  upper  bound  on  the  more  conventional  variations  of  Shellsort. 

To  discuss  Shellsort  requires  some  terminology.  A  p-chain  of  A 
is  a  sequence  of  elements  of  A  occurring  at  intervals  of  p  .  For 
instance,  if  n  =  8  ,  then  A  has  three  3-chains,  {A[  1]  ,A[4]  ,A[7] }, 
then  [A[2],A[5],A[8]}  ,  and  then  [A[3],A[6]}  .  In  general,  A  has 
min(n,p)  p-chains,  each  of  length  rgi  °r  l^j  • 

When  A's  p-chains  are  in  ascending  order,  A  is  defined  to  be 
p-urdered.  To  p-sort  A  is  to  sort  A's  p-chains. 

Shellsort  works  by  repeatedly  p-sort ing  A  for  a  characteristic 
sequence  (abbreviated  to  "sequence"  hereaftei)  of  p's  ,  with  the  last  p 
being  1  .  This  last  value  ensures  that  A  is  sorted  by  this  process,  since 
a  1-ordered  array  must  be  ordered.  Furthermore,  Shellsort  prescribes  a 
particular  technique  for  sorting  each  p-chain,  namely  insertion  sorting. 

Insertion  sorting  is  a  technique  whereby  one  starts  with  an  array 
of  no  elements,  and  some  source  of  n  entries,  and  progressively  builds 
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up  a  sorted  array  starting  with  A[1],A[2], ...  by  (i)  determining  for  each 
entry  where  in  the  array  so  far  constructed  it  should  go  in  order  to 
keep  the  array  sorted,  (ii)  moving  the  appropriate  array  elements  up  one 
place  to  make  room  for  it,  and(iii)  inserting  it.  Since  the  space  consumed 
by  the  partially  constructed  array  and  that  consumed  by  the  remaining 
uninserted  entries  is  just  n  items,  this  method  can  be  used  to  sort 
in  place,  requiring  almost  no  auxiliary  storage,  by  combining  all  the 
operations  for  each  entry  into  the  one  loop,  as  follows: 

procedure  insert ionsort (A) ; 

for  i  :=  2  until  length(A)  do 

for  j  :=  i  step  -1  until  2  while  A[  j-ll  >  A[J  ]  do 
swap  (A[  J-1],A[  j  ])  ; 

The  while  clause  signifies  that  the  iteration  is  to  be  terminated  if 
the  expression  following  the  "while”  becomes  false.  The  procedure  "swap" 
exchanges  the  contents  of  the  locations  named  by  its  arguments.  The 
expression  "  length(A)  "  is  supposed  to  be  what  it  says.  The  variables 
i  and  j  are  assumed  to  be  declared  implicitly,  as  in  ALGOL  W,  by  being 
named  as  the  controlled  variable  of  a  for  loop. 

The  outer  loop  of  the  procedure  cycles  through  the  source  of  entries. 
A[l]  is  not  processed  since  the  destination  of  its  contents  must  be  A[l]  . 
The  inner  loop  takes  each  entry  and  shuffles  it  backwards  through  the 
array  to  its  proper  place.  After  each  execution  of  the  body  of  the 
outer  ^oop,  but  before  i  is  incremented,  the  array  A[  1:  i ]  is  ordered. 

Let  us  define  an  inversion  in  an  array  A  to  be  a  pair  of  elements, 

A[  i  ]  and  A[  j  ]  ,  such  that  i  <  j  but  A[  i  ]  >  A[  j  ]  .  Thus  A  is  ordered 
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if  and  only  if  there  are  no  inversions  in  A  .  Define  an  adjacent 

inversion  to  be  one  whose  elements  are  adjacent.  Then  the  insertion 

sort  above  can  be  seen  to  eliminate  adjacent  inversions.  No  other 

inversions  appear  or  disappear  because  every  other  pair  of  elements 

maintain  their  relative  positions  after  the  exchange.  Thus  each  exchange 

reduces  the  number  of  inversions  in  A  by  one.  Since  A  can  have  up 

to  (”)  inversions  (when  A  is  in  descending  order,  i.e.,  A[i-1]  >A[i] 

everywhere  in  A  ),  this  technique  may  take  up  to  (g)  exchanges  to  sort  A  , 

2 

or  0(n  )  exchanges. 

The  idea  underlying  Shellsort  is  that  moving  elements  of  A  long 

distances  at  each  swap  in  the  early  stages,  then  shorter  distances  later, 

2 

may  reduce  this  0(n  )  bound. 

An  algorithm  for  Shellsort  using  the  procedure  "insertionsort"  is 
not  easy  in  ALGOL.  We  might  write,  in  near-ALGOL: 

procedure  Shellsort  (A,P,m) 
for  i  :=  1  until  m  do 

for  j  :=  1  until  P[i]  do 

insertionsort  (A[*xP[ i  ]+ j  ]) ; 

The  expression  A[*'xpf,j]  denotes  simply  the  j-th  p-chain  of  A  . 

The  more  usual  way  to  write  Shellsort  carries  out  the  insertion 
sort  on  a  time-shared  basis,  Uius: 

procedure  Shellsort  (A, P,m); 
for  i  :=  1  until  m  do 

for  j  :=  P[i]+1  until  length  (A)  oo 

for  k  j  step  -P[  i  ]  until  P[  i]+l  while  A[  k-P[  i  ]  ]  >  A[  k]  d< 
swap  (A[ k-P[ i] ],A[k]) ; 
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Because  SheUsort  works  by  correcting  inversions  within  p -chains, 
it  is  convenient  to  call  such  inversions  p-inversions. 

The  time  spent  by  Shellsort  is  made  up  of  what  it  would  do  with  an 
ordered  array,  plus  an  amount  of  time  at  most  proportional  to  the  number 
of  exchanges  it  must  do  to  sort  the  array.  Since  the  former  time  is  n 
times  the  number  of  passes,  and  since  the  number  of  passes  considered  in 
the  next  chapter  is  always  O(log  n)  ,  we  shall  measure  the  time  required 
by  Shellsort  in  units  of  the  number  of  exchanges  performed.  To  convert 
this  figure  to  seconds,  multiply  it  by  the  number  of  seconds  required  for 
an  exchange,  a  decrement  of  k  ,  a  test  for  k>  P[i]+1  and  a  subsequent 
comparison,  and  add  the  time  required  to  Shellsort  an  ordered  array  of 
the  same  size.  Since  the  dominant  term  in  the  expressions  derived  in 
Chapter  2  is  0(n^2)  ,  the  time  for  exchanges  asymptotically  dominates 
the  0(n  log  n)  time  for  Shellsorting  an  ordered  array,  which  is  why 
the  number  of  exchanges  is  an  adequate  measure  in  that  chapter. 

Let  us  now  summarize  the  remainder  of  the  thesis.  In  Chapter  2,  we 
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show  that  Shellsort  takes  time  proportional  to  n  '  in  the  worst 
case.  Prior  to  this,  only  Papernov  and  Stasevich's  [1965)  upper 

bound  of  0(n"^2)  for  Shellsort  with  Hibbard's  sequence  was  known.  In 
Chapter  3,  we  describe  a  considerably  faster  Shellsort  that  operates 
with  only  0(n  log  n)  units  of  time,  and  In  Chapter  4  we  show  that  under 

quite  reasonable  conditions  this  version  of  Shellsort  can  be  used  to  build 

2 

a  sorting  network  that  requires  0.3  log  n  units  of  delay,  about  1.5  times 
as  fast  as  was  previously  possible.  [Batcher,  1968] . 

Further  prologue  relevant  to  Chapter  2  may  be  found  in  section  2.1  of 
that  chapter.  Chapter  5  presents  a  more  detailed  summary  and  unification 
of  the  results  of  Chapters  2  to  4,  and  also  suggests  problems  for  further 
research. 
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Chapter  2 

Least  Upper  Bounds  for  Most  Shells  orts 


2.1  Discussion 

A  natural  characteristic  sequence  to  follow  when  Shellsorting  is  a 
geometric  progression.  If  one  thinks  of  Shellsorting  as  progressively 
bringing  each  element  closer  to  its  final  position,  in  jumps  of  decreasing 
size,  it  is  "natural”  to  arrange  that  these  jumps  decrease  geometrically; 
this  is  what  happens  in  a  binary  search,  for  example.  Possibly  some  such 


consideration  has  motivated  the  choice  of  a  (usually  slightly  perturbed) 


geometric  progression  for  almost  all  Shellsorts. 

If  a  sequence  of  even  numbers,  followed  by  1  ,  is  used,  Shellsort 
may  take  up  to  n(n-2)/8  exchanges  when  1-sorting.  This  would  happen 
if  one  2-chain  were  1,2,  ...,n/2  and  the  other  were  |+1,  |+2,  ...,  n  . 
Since  this  array  is  2-sorted,  it  is  2k-sorted  for  all  k  >  0  .  Thus  at 
the  last  pass,  the  original  array  is  being  1-sorted,  that  is,  it  is  simply 
being  insertion-sorted,  which  is  readily  seen  to  take 

1+2+3+. 1)  =  n(n-2)/8  exchanges,  for  even  n  ,  an  0(n2) 
figure . 

Shell  [1959]  originally  suggested  the  sequence 

If  n  is  a  power  of  2  ,  this  is 


readily  seen  to  be  a  sequence  dealt  with  in  the  previous  paragraph.  This 
problem  was  recognized  by  Lazarus  and  Frank  [i960],  who  proposed  that 
the  even  elements  in  Shell's  sequence  be  incremented  by  one.  Thus  every 
element  can  be  expressed  as  2k+l  ,  and  its  successor  must  be  either  k 
or  k+1  ,  depending  on  whether  k  is  odd  or  even  respectively.  Now  k|2k 
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and  k+l|2k+2  ,  so  (2k+l,  k)  =  (2k*l,  k*l)  =  1  ;  that  is,  every  consecutive 

pair  of  elements  in  the  sequence  is  cqprime.  We  shall  see  shortly  that 

3 /2  2 

0(n  '  )  ,  not  0(n  )  ,  is  the  (least)  upper  bound  for  Lazarus  and  Prank's 
sequence,  mainly  because  of  this  coprimeness  property. 

Hibbard  [19 63]  suggested  the  descending  sequence  of  all  numbers  of 

iC 

the  fonn  2  -1  <  n  ,  integer  k  >  1  .  When  n  is  one  less  than  a  power 
of  2  ,  this  sequence  coincides  with  both  Shell's  sequence  and  Lazarus 
and  Frank's  sequence.  Many  other  sequences  have  been  suggested  [cf. 

Knuth  1972],  almost  all  of  them  having  in  common  that  they  form  "fuzzy" 
geometric  progressions,  with  every  element  relatively  prime  to  at  least  one  of 
its  nearby  predecessors.  (It  is  interesting  to  note  that  both  of  these 
guidelines  are  ignored  in  the  sequence  of  Chapter  3  for  the  0(n  log2  n) 
Shellsort . ) 

The  next  part  of  this  chapter  will  prove  a  theorem  enabling  us  to 

show  that  the  above  Shellsorts  take  at  most  0(n^/2)  units  of  time, 

provided  their  sequences  have  the  coprimeness  property.  The  last  part 

will  prove  a  theorem  applicable  to  Shellsorts  whose  sequences  are  fuzzy 

geometric  progressions  with  integer  common  ratios,  enabling  us  to  prove 
3/2 

that  the  0(n  )  figure  cannot  be  improved  in  such  a  case. 
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2«2  An  Upper  Bound  For  Most  Shellsorts 

The  first  result  is  essentially  a  generalization  of  Papernov  and 

Stasevich's  theorem  f 1965]  that  Shellsort  with  Hibbard's  sequence  takes 

3/2 

at  most  0(n  )  units  of  time.  The  basic  properties  we  shall  impose 

on  the  class  of  sequences  covered  by  the  result  are  that  they  approximate 
geometric  sequences  and  that  every  d  consecutive  elements  in  the  latter 
part  of  the  sequence  form  a  coprime  set  of  integers,  for  some  d  . 

We  shall  need  in  advance  some  auxiliary  lemmas.  The  first  is  the 
"non-messing-up"  theorem  for  p-sorting  and  q-sorting. 

Lemma  2.1.  Given  positive  integers  p  and  q  ,  and  a  p-ordered  array  A 
with  n  elements,  q-sorting  A  leaves  it  p-ordered. 

Proof.  (This  is  a  slight  modification  of  a  proof  in  [Boerner,  1955,  pl3?].) 

Let  j  be  such  that  A[j-p]  >  A[j]  after  q-sorting.  We  shall  give 
one  method  of  q-sorting  which  contradicts  this,  whence  it  follows  that  all 
methods  contradict  this,  since  the  outcome  of  q-sorting  is  unique. 

Let  A[  j-p] ,  A[j]  belong  to  q-chains  B  and  C  respectively.  Now 
B  and  C  must  be  distinct,  otherwise  Af  j-p]  <  A[j]  because  each 
q-chain  is  ordered. 

Call  the  least  element  of  A  ,  -®  ,  and  the  greatest,  *  . 

Sort  all  the  q-chains  except  B  and  C  .  Now  put  B  and  C  into 
correspondence,  with  A[J-p]  corresponding  to  Afj],  A[j-q-p]  to 
Af.i-q],  Afj+q-p]  to  A[j+q]  ,  etc.  If  necessary,  extend  B  and  C  to 


ensure  that  every  element  has  a  mate,  using  -  ®  for  B  and  •  for  C  . 
Call  the  extended  q-chains  B'  and  C*  .  We  new  have  the  situation  of 
Figure  2.1,  as  the  reader  may  chtck.  (Here  (a,c)  and  (b,d)  are  two 
instances  of  corresponding  elements.  Lower  valued  subscripts  of  A 
correspond  to  elements  closer  to  the  top  of  the  figure.) 


B'  C» 

Figure  2.1 

Corresponding  q-chains  B'  and  C'  . 

Now  sort  B'  and  C'  thus: 

1.  Use  a  sorting  algorithm  which  sorts  every  array  of  a  given  size  n 
by  using  a  fixed  sequence  of  pairs  (i,  j)  depending  only  on  n  and 
drawn  from  [l,n]  x  [l,n]  .  For  each  such  pair,  it  puts  A[i]  and 
A[j]  in  order.  The  insertion  sort  of  Chapter  1,  with  the  while 
condition  deleted,  is  such  an  algorithm. 

2.  For  each  pair  (A[i],A(JJ)  in  B'  ordered  by  this  algorithm, 
simultaneously  order  the  corresponding  elements  (A[i+p],A[j+p]) 
in  C'  .  Thus  B'  and  C'  are  sorted  in  parallel. 
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Let  a,b  in  B*  and  c,d  in  C'  be  four  elements  participating 
in  one  step  of  this  algorithm,  with  a,b,c,d  in  the  order  shown  in 
Figure  2.1.  Suppose  before  the  step,  we  had  a  <  c  and  b  <  d  .  We 
claim  that  after  the  step,  the  two  resulting  corresponding  pairs  will 
still  be  ordered.  This  is  trivially  true  if  neither  or  both  of  (a,b) 
and  (c,d)  are  swapped.  If  only  (a,  b)  is  swapped,  we  must  have  had 
b  <  a  <  c  <  d  before,  and  if  only  (c,d)  is  swapped,  then  we  had 

a  5  b  <  d  <  c  >  dn  both  cases,  both  elements  of  B  are  less  than  or 

equal  to  both  elements  of  C  ,  proving  the  claim. 

Since  corresponding  pairs  are  ordered  at  the  start,  they  must 
therefore  be  ordered  at  the  end,  by  induction  on  the  steps  of  the 
algorithm.  Now  the  extensions  clearly  cannot  have  moved,  so  they  may 
be  removed.  The  result  is  just  as  if  we  had  q-sorted  B  and  C  .  But 
now  A[  j-p]  <  A[ j  1  .  This  contradiction  completes  the  proof. 

An  immediate  corollary  is  that  if  a r  array  is  p^sorted,  then 

Pg-sorted,  and  so  on  up  !-o  pk  ,  it  is  then  p^ordered  for  i  =  1,2,  ...,k 

If  the  diophantine  equation 

P1X1+P2X2+  4  pkXk  "  q  >  a11  Pi  >  0  >  (1) 

has  non-negative  solutions  in  the  xi  ,  then  an  array  p^ordered  for 
these  p^s  is  also  q-ordered,  by  the  transitivity  of  the  ordering 
relation,  since  the  solution  indicates  the  existence  of  a  sequence 
Aid]  <  At  j+P-J  <  A[j+2p1]  <  ...  <  A[j+xlPl]  <  A[j+xlPl+p2]  <  ... 

<  A[j+p1x1+...+pkxk]  =  A[ j+q ]  ,  for  all  j  . 
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Lemma  2.2.  When  gcdfp^Pg, . .  .,Pk)  =  1  ,  and  <1  >  Pm(P1+P2+*  • 

for  some  pm  ,  equation  (1)  always  has  non-negative  integer  solutions 

in  the  x^  . 


Proof.  It  is  well  known  that  when  gcd(p^,p2, . .  *>Pk)  =  1  >  the 
diophantine  equation  (1)  always  has  a  solution  in  x^,  ***,xk  '  The 
set  of  possible  solutions  must  be  closed  under  the  operation  of  simultan¬ 
eously  adding  p^  to  and  subtracting  p^  from  x^  ,  since  this  adds 
(p^Pj  "PjPi^  =  0  to  the  left-hand  side  of  the  equation.  Thus  there  must 
be  a  solution  in  which  for  all  i  /  m  ,  0  <  ^  <  p  ,  since  each 

other  than  x  can  be  adjusted  by  increments  of  p  ,  at  a  cost  to  x  . 

m  mm 

But  now  we  have 


Pm«P1*P2*->Pk)-Pn,)  >  (PlVp2V"-*pkV-Vm  > 


=q-px  , 


>P  (p,+p0+.-*+P,  -P  )-p  x  , 
-  * 2  m  m  ' 


(i  l<m3Xi  <pm) 

(equation  (1)) 
(Hypothesis) 


from  which  it  follows  that  >  0  in  this  particular  solution. 

Q.K.D. 

We  may  infer  from  Lemma  2.2  that  if  an  array  A  is  p^-ordered 

for  p^  ...>Pk  ,  and  gcd(p1,p2, . .  .,pk)  =  1  ,  A  is  p-ordered  for  all 

p  >p  (p.+p  + . .  .+p.  -p  )  ,  where  p  is  any  one  of  the  p.’s  .  Thus 
—  m  x  c  k  in  in  i 

an  upper  bound  on  the  number  of  elements  of  a  p-chain  B  which  may 
precede  and  be  greater  than  a  giver,  element  of  B  is 

p^(p^+p2+ . . •+Pj,-pm) /p  since  the  elements  of  a  p-chain  are  spaced  p  apart. 
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Hence  to  p-sort  A  requires  at  most  np  (p  +p  + . .  .+p  -p  )/p  exchanges. 

roic  x  in  J 

Call  this  the  first  upper  bound. 

Within  a  p-chain  (of  at  most  Tp"!  elements),  the  average  element 

X  n 

can  participate  in  at  most  — ( —  ( — 1)  exchanges  during  p-sorting. 


So  the  total  number  of  exchanges  required  is  at  most  ^  n(  j”-“]-l)  . 
Hence  n  /2p  is  also  a  (larger)  upper  bound,  the  second  upper  bound. 

Before  proceeding  with  the  formalism  of  the  main  result,  let  us 
provide  some  insight  into  what  is  going  on.  The  two  upper  bounds  we 
have  just  derived  are  about  to  be  used  to  bound  the  time  required  by 
Shellsort  using  a  characteristic  sequence  having  properties  shared  by 
most  of  the  suggested  sequences,  excluding  those  for  which  Shellsort  is 
clearly  an  0(n  )  algorithm.  The  properties  we  require  of  a  sequence  S 
are  (for  the  moment):  that  there  is  a  sequence  S'  such  that  to  each 
element  p  of  S  there  corresponds  an  element  p'  of  S’  ,  with  fixed 
bounds  on  P-P'  ("additive  fuzziness",  namely  -a  <  p-p'<  b  ,  for  fixed 
a,b  );  and  that  each  element  of  S'  is  between  r  and  s  times  its 

successor,  for  fixed  r,  s  >  J  .  That  is,  S*  is  a  sloppy  decrees ii.g 

geometric  progression  ("multiplicative  fu iciness" ) . 

These  conditions  are  general  enough  to  cover  most  sequences  that 
could  be  called  "fuzzy"  geometric  progressions. 


We  also  impose  a  coprimeness  condition  on  neighboring  elements  of 
the  sequence,  to  satisfy  the  conditions  required  for  the  first  upper  bounds. 
For  some  integer  d  independent  of  n,  every  d  consecutive  elements  must  be 
relatively  prime. 


Wc 

She! ! s 
1  arge 
large 


shall  use  the  first  upper  bound  to  bound  the  time  spent  by 
ert  when  p-sorting  for  small  p  .  The  second  upper  bound  is  for 
p  .  While  the  :  att*  r  remark  makes  sense  (n^/k’p  is  small  for 
p  ),  the  former  may  seem  not  to,  at  first,  since  p  appears  in 
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the  denominator  of  the  first  upper  bound  also.  However,  the  numerator 

p 

ic  an  O(p')  expression,  and  the  p^s  will  be  just  those  elements  in 
the  sequence  that  immediately  precede  p  .  Our  conditions  ensure  that 
these  decrease  in  approximate  proportion  to  p  ,  so  tne  first  upper  bound 
is  really  an  0(p)  expression  rather  than  an  0(l/p)  one. 

Because  we  do  not  need  the  first  upper  bound  for  large  elements 
of  S  ,  we  shall  actually  restrict  the  condition  that  S  be  a  fuzzy 
ccprime  geometric  progression  to  the  small  elements  of  t  .  We  iapose 
a  much  weaker  condition  on  the  large  elements  of  S  ,  that  the  sum  of 
their  reciprocals  be  an  0(l//n)  quantity.  (This  condition  is  readily 
seen  to  hold  for  the  first  half  of  a  fuzzy  geometric  progression  with 
all  elements  less  than  n  ,  since  the  smallest  element  in  that  half  is 
itself  an  0(/n)  quantity.) 

We  now  proceed  with  the  formalism. 


Theorem  2.h.  Let  r,s,t,u,v  be  reals,  with  r,s  >1  and  t,u,v  >0  . 
Let  a,  b,d  be  integers,  with  a,  b  >  0  ,  d  >  2  . 

To  each  array  size  n  ,  suppose  there  corresponds  a  sequence 
PrPo,...,Pm  snd  an  index  c  (denoting  the  cut-off  point  p  ) 
such  that 


(i) 

P  1 

rr 

(ii) 

Y. 

1  <  j  <C 

(iii) 

c  >  d 

(to  ensure  that  fhellsort  really  sorts) 

~  5  ~7~  (the  large  p.  for  the  second 

Pj  /n  J 


upper 


(so  that  the  first  upper  bound  is  usable  for 

elements  p  ,P  p  j,) 

*  *c+d-.l  1 


hound ) 
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( iv)  p >0  <  t/n  (to  keep  small  those  elements  covered  by 

the  first  upper  bound,  in  conjunction  with 
condition  (vii)) 

(v)  j>c  implies  gcd(pJ_1, . .  .,Pj_d)  =  1  (for  the  first 

upper  bound) 

(vi)  There  is  a  sequence  S'  =  p’,p'  , . . .,p’  in  which 

c  c»  jl  in 


-a  < 

?!  <  b  ,  for 

i  --  c,  ...,m-l  and  p' 

m 

(vii) 

In  S‘  , 

^  ^1.1  < 

for  i  =  c, . .  .,ra-l  . 

(viii) 

In  S’  , 

pi  s  spPi  - 

for  1  —  c  y  •  •  • 

Then  with  these  conditions,  Shellsort  takes  0(n^/2)  units  of  time. 


Proof •  Applying  the  second  upper  bound  is  easy.  The  total  time 
required  for  -sorting,  for  j  <  c  ,  is  at  most 


l 

1  <  j  <c 


(using  the  second  upper  bound) 


„  1  3/2 

<  2un 


(by  condition  (ii)) 


The  remainder  of  the  sequence  requires  a  little  more  work.  However, 
the  underlying  idea  remains  simple,  that  the  first  upper  bound  decreases 
approximately  geometrically  as  Shellsort  progresses  through  the  sequence, 
and  hence  the  total  time  required  is  proportional  to  that  required  for 
Pc -sorting  alone. 
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First  note  that 


and 


Also 


Pi  <pi  +  b 

(condition  (vi)) 

<skpjtk+b  ,  k>0 

_  .  k  k 

(condition  (viii)) 

Pi  -  “  PH-k+s  s*b 

(condition  (vi)) 

(2) 

Pi  >  Pi  - » 

(condition  (vi)) 

>i-kPi+k-k  .  k>0 

(condition  (vii)) 

(condition  (vi)) 

(3) 

pi+k  S  (Pi*“)/>-k  ♦  b 

(condition  (vi)) 

(*0 

the  total  time  required  for 

pj -sorting,  for  j  >  c  ,  is 

at  most 

npj-i(pj-2+  ••• + 


c  <  j  <m 


Pj-d^  /  pj  (the  first  upper  bound, 

and  conditions  (iii),  (v)) 


,  a+b)  ]  /  p . 


-  ^  n(8P^sa+b)l(r2p  ♦c5?a+b)+(e^p.+0-^b)+...+  (B<L  +B 

(by  inequality  (£)) 

<  Z  (p.+»b) 

c<j<u  pj  y  J 

2  d  r* 

<  (s  + . .  ,+  s  )n  Z-  ( sp  +  sa+b+  ( 2sa+b)  ( a+b) )  (since  p.  >  1  ; 

1  <p  <t  /n  J  J  - 


(since  s  >  1) 


also  note  use  of 
condition  (iv)) 


<  (s2+. 


•  .+sd>n  Z  f d4/y»)  „ 

l<rk<  V.  r  y 


(K  -  (c(2a+l)+b)(a+b))  ;  s  >  1 

and  using  inequalities  (}),  (it)) 


lit 


(summing  the 
geometric 
progression) 


=  (s2+...+  s^)n  ^  (srkv+K) 

a+1  ^  k  „  t  /n+a 
v  -  v 


Hence  the  total  time  required  for  Shellsort  is  less  than 


This  completes  the  proof  of  the  0(n^/2)  upper  bound  result  for 
Shellsort  with  this  class  of  characteristics  sequences. 

The  reader  may  readily  calculate  the  values  of  r  ,  s  (both  2  in 
the  sequences  of  Chapter  1),  u  (a  function  of  t  ,  clearly,  as  well 
as  of  r  ,  s  ,  a  and  b  ) ,  a  ,  b  and  v  for  various  sequences,  and 
may  amuse  himself  determining  the  value  of  t  that  minimizes  the  bound 
in  each  case. 

For  the  case  of  Hibbard's  sequence,  for  example,  where 

_  2^-l0b2  n-J  ^  +  1  -1  ,  take  r^s-t=v  =  d-  2,  u  =  a  =  1  , 
pi 

b  -  0  ,  c  -  lloC2  nj  -  -  logg  n  .  Condition  (i)  is  satisfied  since 
Hibbard's  sequence  contains  1  .  Condition  (ii)  is  satisfied  since 
P  ^  2^n-l  >  n  (for  n  >  1)  ,  and  p.  =  2p  +  1  .  Condition  (iii) 
holds  for  n  >  /0  .  Condition  (iv)  holds  since  pc  =  2  /n  -1  (see  above). 
Condition  (v)  holds  since  (p.,pi+1)  -  1  trivially.  Conditions  (vi)  to 
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(viii)  hold  if  p*  is  taken  to  be  p^+1  •  Thus  our  theorem  is  true 
for  all  n  >  52  .  Making  the  substitutions,  the  dominant  term  of  our 
upper  bound  is  52.5  n^2  . 


16 


2.3  Optimality  of  the  0(n^2)  Bound 

In  this  section  we  shall  construct  arrays  that  take  time  proportional 

3/2 

to  n  '  to  sort  using  Shellsort  with  sequences  that  approximate  a 
geometric  progression  with  integer  common  ratio.  Most  of  the  proposed 
sequences  to  date  have  this  property. 

The  basic  tool  for  the  construction  is  a  sorting  template. 

(Visualize  this  as  a  strip  of  cardboard  with  some  holes  in  a  straight 
line;  the  elements  of  the  template  are  the  hole  locations,  numbered 
right  to  left.) 

Definition  2.1.  A  sorting  template  is  a  set  of  natural  numbers 
containing  0  and  closed  under  addition. 

Definition  2.2.  The  sorting  template  generated  by  a  set  is  the  least 
sorting  template  containing  that  set. 

For  example,  the  sorting  template  generated  by  {1}  is  the  set  N 
of  natural  numbers,  while  that  generated  by  (2,5}  is  N-fl,3]  . 

Definition  2.3.  An  array  element  A[i]  is  visible  through  a  sorting 
template  T  at  j  when  j-i  is  in  T  . 

(Visualize  A  as  being  written  on  a  sheet  of  paper  underneath  T  , 
with  subscripts  numbered  from  left  to  right .  The  zero  hole  of  T  is 
over  A[ j  ]  . ) 
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Definition  2.1*. 


An  array  A  is  constructed  under  a  sorting  template  T 
at  a  sequence  q(l], . .  .,q[m  J  thus: 
i  :=  0; 

for  i  :=  1  until  m  do 
for  j  :=  1  until  n  do 

if  A[j]  is  undefined  and  A[j]  is  visible  through  T  at  q[i] 
then  begin  1  :=  /+ 1;  A[j]  :=  l  end; 

Note  that  each  element  of  A  is  initially  undefined,  and  becomes 
defined  by  assigning  I  to  it,  after  which  it  is  defined. 

Intuitively,  we  put  the  template  down  with  the  zero  hole  of  the 
template  on  A[q[l]]  ;  then  we  move  the  template  to  A[q[2] ]  and  so  on. 

At  each  position,  we  fill  in  all  the  visible  but  as  yet  undefined 
elements  of  A  ,  using  ever-increasing  numbers.  Some  of  the  language 
we  employ  later  assumes  that  this  intuitive  view  has  been  grasped. 

Lemma  2.5.  If  pcT  ,  then  an  array  A  constructed  under  the  sorting 
template  T  is  p-ordered.  (Hence  the  name  sorting  template.) 

Pro°f»  Say  A[j]  becomes  defined  when  T  is  at  q  .  Then  q- j  e  T  . 

So  q-(j-p)  c  T  also,  since  prT  and  T  is  closed  under  addition. 

Thus  At  j-p)  must  be  assigned  its  value  before  A[j]  ,  whence 
A[  j-pj  <  A[j]  . 

Q*E.D • 

Let  us  use  the  notation  [ a,  b  ]  to  denote  (the  set  of  integers  in) 
the  interval  from  a  to  b  inclusive.  By  the  length  of  [a,b]  we 
shall  mean  b-a+1  .  By  A  <  13  ,  for  intervals  A  and  n  ,  we 
shall  mean  every  element  of  A  is  less  than  every  clement  of  li  . 
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Let  us  now  give  an  informal  preview  of  the  formalities  to  follow. 

Our  goal  is  to  be  able  to  construct  arrays  that  take  time  n^2  to 
Shellsort . 

As  the  preceding  section  showed,  p-sorting  for  p  near  the  be¬ 
ginning  and  end  of  sequences  takes  only  linear  time;  only  near  the 
middle  can  an  additional  factor  of  /n  creep  in  to  spoil  things.  Thus 
if  we  are  going  to  find  arrays  that  take  time  n^2,  we  should  arrange 
things  so  that  Shellsort  finds  the  going  toughest  halfway  through  the 
sequence.  To  do  this,  we  shall  construct  arrays  that  look  as  if  Shell- 
sort  is  already  halfway  through  sorting  them,  and  yet  that  have  many 
inversions.  Thus  when  sorting  these  arrays,  Shellsort  will  zip  through 
the  first  half  of  the  sequence  finding  nothing  to  do,  and  then  suddenly 
hit  a  brick  wall,  and  take  time  in  a  single  p-sorting  pass.  We 

do  not  much  care  what  happens  for  the  rest  of  the  sequence. 

The  preceding  definitions  and  lemmas  established  the  basic  tools 
lor  the  construction.  The  following  lemmas  establish  some  quantitat.’ 
results  of  use  in  the  analysis  of  the  actual  construction,  which  is 
described  in  the  first  paragraph  of  the  Proof  of  Theorem  2.11. 
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Lemma  2.6.  The  sorting  template  T  generated  by  [a,b]  is  IJ  [ ka, kb j  . 

k  >0 

Proof.  The  union  certainly  contains  {0}  and  [a,b]  .  To  see  that  the 
union  is  closed  under  addition,  take  m,n  such  that  k^  <  m  <  ^b 
and  k2a  <  n  <  kgb  .  Then  (k^k^a  <  ra+n  <  (k^k^b  .  Thus  the 
union  is  a  sorting  template  containing  [a,b]  . 

To  see  that  it  is  the  least  such  sorting  template,  suppose  m  is 
the  least  integer  which  is  in  the  above  union,  but  is  not  in  every  other 
sorting  template  containing  [a,bl  .  Then  (k+-l)a  <  m  <  (k*-l)b  for 
some  k  >  1  .  But  every  number  in  this  range  is  expressible  as  the  sum  of 
two  numbers  from  [a,b]  and  [ka, kb]  respectively.  This  contradicts 
the  closure  property  of  the  template  lacking  m  . 

Lemma  2.7 .  If  T  is  generated  by  [a,b]  and  a  <  b  ,  then  ieT 
for  all  i  >  a2/ (b-a)  . 

Proof.  The  complement  of  T  ,  T  ,  is  the  set  [l,a-l]  U  [b*-l,2a-lj  U  ... 

|J  I  kbfl,  (k+l)a-l]  ij  ...  ,  by  Lemma  2.6.  When  kb+1  >  (k+l)a  ,  these 
intervals  vanish.  This  happens  for  k  >  (a-l)/(b-a)  .  Thus  the  largest 
possible  element  of  T  is  ( (a-l)/( b-a) )a-l  ,  which  is  certainly  less 
than  a  /(b-a)  . 

Lemma  2.8.  If  T  is  generated  by  [a,b]  ,  a  <  b  ,  then  for  any  non¬ 
negative  integer  p  <  a  there  is  an  interval  I  of  length  p  in  T 
such  that  T  has  exactly  Pflfl  intervals  of  the  form  [ka,kb]  , 
k  >  0  ,  which  are  less  than  I  . 
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Proof.  Choose  I  of  length  p  and  lying  In  [  ( -i)b+l,  a-1] . 

This  latter  interval  lies  in  T  since  it  is  of  the  form  [kb+l,(k+l)a-l] 
(see  proof  of  Lemma  2.7),  and  is  of  length  ( (a-b)+b-l),  which  is 
certainly  not  less  than  p.  Now  take  the  f^l  intervals  of  T  to  be 
[0,0]  ,  [a,b]  ,  [2a,  2b]  ,  ...  ,  [  ( ffij?  l|E|l  ‘^b]  '  911  °f  WhiCh 

are  clearly  less  them  I. 


Lemma  2.9.  With  T,I  as  in  Lemma  2.8,  the  number  of  elements  of  T 
which  are  less  than  any  element  of  I  are  at  least  ^  (a-p)(^£  -  1)  • 


Proof.  We  shall  sum  just  the  complete  intervals  [ka, kb]  of  T 

f°r  k  to  rSETi  • 


Ti  k(b-a) 


0  <k< 


Lemma  2. ID.  Let  ceT  and  let  A  be  constructed  under  T  at 
c,2c,5c, .. ,,mc  ,  for  some  m  .  Then  if  sone  Af j]  is  visible  through 
T  at  ic,  it  is  visible  through  T  at  jc  for  all  j  >  i.  That  is, 
once  visible,  always  visible.  Conversely,  every  invisible  element  must 
be  undefined. 
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Proof.  If  Afj]  is  visible  through  T  at  ic  ,  the  ic- j  e T  . 

But  ccT  ,  whence  jc-j  e T  ,  by  closure  of  T  under  addition.  Thus 
A[,j]  is  visible  through  T  at  jc. 

We  are  now  ready  for  the  main  theorem. 

Theorem  2.11.  Let  r,s  be  reals  greater  than  1  .  Let  a,b  be 

non-negative  integers.  Then  there  exist  non-negative  reals 

t,u,c,  with  rt  <  u  ,  such  that 

if  for  every  array  size  n  Shellsort  uses  a  sequence 

=  p^,.7.,pm  with  the  properties  that 

(i)  there  is  some  clement  p  in  Sn  such  that  t/n  <  p  <  u/n 

and  for  all  p.  preceding  p  ,  there  is  an  integer  m 

for  which  m(p-a)  <  p.  <  m(p+b)  ;  and 

J 

(li)  the  successor  of  p  in  S  ,  q  ,  satisfies 
r(q-b)-a  <  P  <  s(q+a)+b  and  also  q  <  p-a; 

Shellsort  takes  at  lease  c units  of  time  on  some  arrays 
of  length  n  . 

Proof.  Construct  A  under  the  sorting  template  T  generated  by  the 
integers  in  [p-a,p+b]  at  the  sequence  p,2p,Jp, . . . ,gp,  where  g  is 
large  enough  to  ensure  that  A  is  completely  defined.  By  Lemma  2.7, 
every  element  of  A  is  visible  through  T  beyond  n+(p-a)?/(a+b)  , 
whence  it  suffices  to  take  g=  Tn+(p-a)2/(a+b)l  . 

We  now  show  that  A  takes  cn>/l-  units  to  Shellsort. 
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For  the  in  up  to  and  including  p  ,  the  conditions  of 
the  theorem  ensure  that  pR  is  in  T  .  Thus  after  pk-sorting,  for  all 
Pk  preceding  q,  A  is  left  unchanged  (Lemma  2.5).  So  the  number  of 
q-inversions  in  the  original  A  gives  a  lower  bound  on  how  fast  A  will, 
be  sorted.  We  shall  give  a  lower  bound  on  the  number  of  these  inversions. 

Initially  T  is  at  p  in  the  construction,  whence  the  first  interval 
of  T  "covers"  some  of  the  first  p  elements  of  A  .  As  T  advances 
by  p  units  each  time,  exactly  one  new  interval  of  T  (which  of  course 
must  move  as  T  does)  appears,  to  cover  elements  of  A  .  Eventually 
the  I  of  Lemma  2.8  (of  length  q  in  this  case)  appears.  There  are 
L^J  disjoint  contiguous  sequences  of  elements  of  A  of  length  p  such 
that  as  T  progresses,  I  will  partly  cover  each  such  sequence  in  turn. 
Hence  there  must  be  at  least  |^j  positions  of  T  during  this  construc¬ 
tion  for  which  q  elements  of  A  are  covered  by  I  . 

During  this  process  there  is  an  f  such  that  when  T  is  at  fp  , 

interval  I  covers  some  contiguous  set  V  of  elanents  of  A  within 

A[l]  to  A[p]  .  By  Lemma  2.10,  the  elements  of  V  must  be  undefined 
at  this  time.  By  the  construction,  the  set  D  of  those  elements  of  A 
visible  through  T  at  fp  and  that  lie  to  the  right  of  V  must  be  or 
become  deiined  at  this  time,  lienee  the  value  of  each  element  d  in  D 
must  be  less  than  that  of  each  element  of  V  (since  the  latter  becomes 
defined  later)  and  in  particular  less  than  one  that  is  in  d»s  q-cliain, 
since  every  q-chain  must  have  a  representative  in  any  interval  of  length  q 
Thus  there  are  at  least  |d  |  inversions  within  the  q-chains  of  A  • 

I  rorr,  this  time  on,  as  T  is  advanced,  new  elements  of  A  become 

covered  by  I  ,  resulting  in  |p  |  more  q-inversions  by  the  same  argument, 
hvent ually  1  wij  1  be  at  some  point  beyond  n  ,  and  the  number  of  inversions 
contributed  in  this  way  for  T  at  each  subsequent  position  will  start  to 


decrease.  After  such  advances  of  T  ,  I  "falls  off"  the  right 

hand  end  of  A  . 

We  might  start  to  count  the  number  of  q-inversions  in  A  by 
multiplying  |n|  by  L=J  since  the  q-inversions  are  all  distinct 
(because  T  moves  further  than  q  places  each  time)  .  However,  this 
would  then  include  those  q-inversions  whose  right-hand  members  lie 
outside  A  .  So  we  need  an  upper  bound  on  the  number  of  those  q-inversions, 
which  we  shall  then  subtract  from  |D  j  *  L  p  _l  to  give  a  lower  bound  on  the 
number  of  q-inversions  in  A  . 

At  the  first  position  of  T  at  which  we  start  to  lose  q-inversfons 
to  this  effect,  we  lose  q-inversions  corresponding  to  just  that  element 
in  the  first  interval  of  T  ,  namely  the  zero  element.  At  the  next 
position  we  lose  at  most  those  q-inversions  corresponding  to  the  zero, 
and  the  interval  [p-a,p*b]  ,  a  total  of  l+(b+a+l)  .  At  the  k-th 
position,  we  xose  a  number  of  q-inversions  bounded  by 

Y.  ( j(b+a)+l)  . 

0  <  ,j  <  k 

I  Mis  process  stops  as  soon  as  the  interval  I  of  T  leaves  the  array. 

Hence  k  may  be  bounded  by  the  result  of  Lemma  2.8.  3o  the  subtraction 
term  is  at  most 


2k 


£  (J(Wa)U) 


0  <j  <k 


r*  t  2  i 

/  .  (±(a+b)k  +  k+l)  ,  where  k  =  (k(k-l) . . .(k-i+1) 


xs^r^i 


f  rP-a-q-i 

1  3  2  I  a+b  I  y  4  1 

=  7  (a+b)  k  +  3k  +  6k  ,  since  k  is  -r 

b  |_  J1  a  <k  <b  J 


A 


"I  Eli? -a 

<  i  (a+b)  k5+5k  a*b 

L  J° 


- 1  <«*»)(  (^^  * 


4  • 


|Dj  is  given  by  Lemma  2.9,  namely  |  (p-a-q) (^~a^  -  1)  .  Hence 


the  number  of  inversions  in  q-chains  is  at  least 


|  <S  -  !)(«-.)(“?  -  1)  -  |  <*b)««g£>3  *  5 


>-  -  (r-l)  /n  +  -  +  b-a 

>  |4-l)(|(r-U/»  -£ - ^ - 


1, 

‘  T,  <atb>  - a+b - 


*  s  a  b  j  +  f  5  (a-l)^n  + 


-  -a+b 
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There  is  a  term  In  n 


in  the  above,  namely 


3/2 


•  (i(r-l))2  /  (»b)  -  £  (a»b)(H^)Jj  „J/2  , 

This  coefficient  is  not  always  positive.  (Consider  arbitrarily 
large  u  .)  However,  the  two  main  terms  of  the  coefficient  are  always 
positive,  since  r,  s  >1  .  The  first  term  is  proportional  to  t^/u  , 
the  second  to  \?  .  Thus  one  can  choose  suitably  small  t  and  u  to 
ensure  that  the  first  term  exceeds  the  second,  so  that  their  difference 
is  then  positive. 

This  completes  the  proof. 

Tn  the  case  of  Shellsort  with  Hibbard's  sequence,  the  parameters 
are  r  s  =  2  ,  a  -  0  ,  b  -  1  . 

To  verify  that  Hibbard's  sequence  satisfies  the  conditions  of  the 
lemma  with  these  parameters,  note  that  for  any  p  in  the  sequence,  the 
interval  (pj-a,p^.+b]  is  [2 ‘-1,2  ]  for  some  k  ,  and  thus  every  larger 
element  21-!  is  contained  in  the  interval  [m(2k-l),m2k]  where 
m  -  2  .  Indeed,  this  condition  will  be  satisfied  for  any  "fuzzy" 

geometric  progression  with  r  being  an  integer  >  1  and  s  =  r  .  The 
other  cond  tions  are  easily  verified. 


Chapter  3 


An 


n  log  n 


Shellsort 


We  show  in  this  chapter  that,  using  the  sequence  2p3q  <  n  , 
p,q  non-negative  integers,  Shellsort  takes  |  n(logg  n)(log3  n)  steps, 
and  also  admits  of  a  simplification  in  which  the  innermost  loop  can  be 
replaced  by  one  instruction. 

Let  us  establish  some  properties  of  arrays  that  are  both  2-ordered 
and  3-ordered. 

Lemma  3.2.  If  p  >  1  ,  then  p  is  representable  as  a  sum  of  2's 
and  3's  ;  that  is,  there  are  non -negative  integers  r,s  ,  with 
p  =  2r  +  3s  . 

Proof.  If  p  is  even,  then  p  =  2(p/2)  .  Otherwise,  p  =  2(^)  +  3  . 

Corollary  3«3«  If  A  is  2-ordered  and  3-ordered,  it  is  p-ordered  for 
all  p  >  1  . 

Proof.  Follows  from  Lemma  3*2  and  the  transitivity  of  <  . 

Lemma  3.k.  If  A  is  2-ordered  then  for  all  j  satisfying  1  <  j  <  n  , 
cither  A[  j  -1 !  <  Af  j  ]  or  A[,j]<A[j+l]  . 
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Proof. 


If  not,  then  AfJ-1]  >A[j]  >  A[j+1]  ,  a  contradiction. 


An  immediate  corollary  is  that  if  A  is  p-ordered  for  all  p  >  1  , 
no  element  A[j]  may  be  a  member  of  more  than  one  inversion,  and  every 
inversion  must  involve  two  adjacent  elements.  Thus  to  sort  a  2-ordered 
and  3 -ordered  array,  it  suffices  to  swap  the  inverted  adjacent  pairs, 
which  can  be  done  in  one  pass,  during  which  each  of  the  n-1  adjacent 
pairs  are  compared  and  exchanged  if  necessary. 

All  of  the  above  applies  equally  well  to  the  p-chains  of  an  array. 

In  particular,  when  A  is  2p-ordered  and  3p-ordered,  its  p-chains  are 
2-ordered  and  3-ordered,  which  means  we  can  p-sort  as  above,  and  only 
take  n-p  canparisons  and  exchanges. 

Applying  all  this  to  Shellsort,  we  then  deduce  that  any  sequence 
Win  servv;  our  purposes  if  (a)  it  contains  1  ,  and  (b)  if  p  is  in 

the  sequence  p  is  preceded  by  2p  and  3p  .  Furthermore,  we  only  need 
use  those  elements  less  than  n  .  The  smallest  such  sequence  contains 
every  number  of  the  form  2P3<1  <  n  ,  for  integer  p,q  >0  .  The  inequality 
I'P;q  <  n  can  be  written  as  p/log?  n  +  q/log  n  <  1  ,  an  inequality 
linear  in  p  and  q  . 

Estimating  the  leng'.h  L  of  this  smallest  sequence  is  made  easy 
using  a  geometric  argument.  In  Figure  3.1,  associate  the  lattice  point 
(p»q)  with  the  element  2^3q  .  The  three  bounds  p  >  0  ,  q  >  0  and 
p/l°<$2  n  +  n  <1  define  the  three  sides  of  a  triangle. 
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Figure  3.1.  A  triangular  array  of  elements  of  >  0} 

In  Figure  3.1,  the  p  and  q  intercepts  of  the  sloping  line  are 
log2  n  and  log^  n  respectively,  whence  the  area  of  the  triangle  is 

§  logg  n  log3  n  ,  a  first  approximation  to  L  which  we  improve  thus: 

We  cla-'m  that  the  interior  of  the  triangle  is  completely  covered 
by  those  unit  squares  whose  lower  left  vertices  are  lattice  points  in  or 
on  the  triangle  (other  than  on  its  hypotenuse).  For  suppose  (x,y)  is 
in  the  triangle.  Then  (x,y)  is  covered  by  the  square  whose  lower  left 
vertex  is  (L*J,LyJ).  But  (j_x_|,LyJ)  is  easi^  seen  to  be  in 
or  on  the  triangle  but  not  on  the  hypotenuse  (since  ^x,y)  is  net  on 
the  hypotenuse) .  This  proves  the  claim. 

But  those  lattice  points  in  the  claim  correspond  exactly  to  the 
elements  of  our  sequence.  Hence  there  are  at  least  ~  (logg  n)(log^  n) 
elements  in  the  sequence,  since  the  area  of  the  squares  corresponding  to 
these  elements  exceeds  the  area  of  the  triangle.  So  our  first  approximation 
turned  out  to  be  in  fact  a  lower  bound. 
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This  bound  is  far  from  attainable,  since  the  boundary  of  the  squares 
is  very  racked  near  the  hypotenuse,  as  in  Figure  3.2  (a). 


Figure  3.2(a).  A  triangle  covered  by  squares. 

(b).  A  triangular  extension. 

One  way  to  improve  the  bound  is  to  give  the  triangle  a  more  ragged 
hypotenuse.  Take  copies  of  the  triangle  of  Figure  3.2(b)  and  paste 
them  over  the  dark  regions  of  Figure  3.2(a).  We  need  Llog^  nj  such 
triangles,  each  of  area  ^  iugy,  3  ,  giving  a  total  area  of  more  than 

r;  (log^  n  -  1)  -log;j  3  ,  which  is  more  than  i  log^  n  -  0.8  . 

To  see  that  the  interior  of  these  triangles  are  all  covered  by  the 
squares,  note  that  if  (x,y)  is  in  some  small  triangle  t  ,  then 
(LxJ>l_yJ)  must  be  on  the  horizontal  lattice  line  passing  through  the 
lower  right  vertex  v  of  t  ,  and  must  be  to  the  left  of  v  ,  and  hence 
inside  the  main  triangle. 

Hence  the  area  of  the  squares,  and  thus  the  number  of  elements  of  the 
sequence,  is  bounded  below  by  |  log^  n(log^  n  +  1).  -  0.8  . 
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To  get  an  upper  bound  we  use  &  very  similar  argument.  First 
replace  the  hypotenuse  by  one  almost  parallel  to  it: 

p/log2(n-l)  +  q/log^(n-l)  =  1  . 

(This  uses  the  fact  that  2p3q  <  n-1  ,  and  simply  gives  us  a  marginally 
tighter  upper  bound.)  Now  associate  with  each  lattice  point  p  in  the 
new  triangle  or  on  its  hypotenuse  (but  not  on  the  p  or  q  axes,  and 
hence  not  on  the  endpoints  of  the  hypotenuse)  the  unit  square  whose  upper 
right  vertex  is  p  .  These  squares  are  all  clearly  inside  the  triangle. 

We  left  out  ..  j_log2(n-l)  J  +  i_log3(n-l)  J  +  1  points  on  the  axes  (the  "1" 
is  the  origin) .  Hence  there  are  less  than 

2  1og2(n-l)log:5(n-l)  +  l_log2(n-l)  J  +  J_log5(n-l)  J  +  1  elements  in  the 
sequence. 

The  reader  may  check  that  the  earlier  "ragged”  argument  still  works, 
this  time  using  the  reflection  |\  of  the  triangle  of  Figure  3.2(b), 
and  removing  copies  of  this  triangle  from  the  main  triangle.  (The 
hypotenuses  of  the  small  triangles  still  coincide  with  the  hypotenuse  of 
the  main  triangle.)  Hence  we  may  subtract  |  10&2  n  -  0.8  from  our 
upper  bound. 

In  conclusion,  we  may  bound  the  length  L  of  our  sequence  by 
2  log2  n ( log n  +  1)  -  0.8  <  L  <  |  log^n-l)  (log^(n-l)  +  1+  2  log,  2)  +  1.8 

where  2  log^  2  is  about  1.26  .  The  width  of  this  bound  is  about 
log,  n  (note  that  log2(n-l)  .  log2  n  -  n'1  log2  e)  .  So 
l  O 

o  1°K^  2(log2  n)  ,  or  C.315(log2  n)2  ,  is  a  good  approximation  to  L  . 
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It  follows  immediately  that  0.315  n(log2  n)2  is  a  good  estimate 

of  the  best-case,  worst-case  and  average  number  of  comparisons  required 

by  Shellsort  using  this  sequence.  This  is  of  interest  in  that,  at  least 

asymptotically,  it  is  the  fastest  known  way  to  Shellsort,  in  the  sense 

that  the  other  Shellsorts  we  could  analyze  take  time  proportional  to 
3/2 

n  on  some  arrays.  However,  it  is  of  more  interest  for  a  reason  we 
vill  pursue  in  Chapter  4.  Here  we  shall  confine  ourselves  to  seme 
remarks  about  this  particular  Shellsort. 

There  are  many  orders  in  which  the  sequence  may  be  generated  and  yet 
have  2p  and  3p  precede  p  .  One  computationally  convenient  sequence 
generates  all  2p3q  for  pfq  =  |_log2(n-l)  J  ,  then  all  2P3q  for 
p+q  =  J_  log2(n-l)  J-l  ,  and  so  on  until  p+q  vanishes.  Thus 
2*2P?q  =  2pfl3q  and  3-2P3q  =  2p3q+1  ,  both  of  which  are  in  the  set 
preceding  the  one  containing  2P3q  .  Within  a  set  for  which  p+q  =  i  , 
start  with  21  ,  then  multiply  by  3/2  until  the  result  is  odd  or  >  n  . 
Expressing  this  in  near-ALGOL,  we  have 

for  i  2  T  j_  2  i  (n-1)  J,  i  *  2  while  i  >  1  do 
f°r  J  :=  i>  (3x,'j)  t2  while  j  mod  3  =  0  and  j  <  n  do 
for  k  :  1  step  1  until  n-,1  d£ 

ifA[kJ  >  A[ k+j J  then  swap  (A[  k  ],A[  k+  j  ])  ; 

Here  a  1  b  is  an  abbreviation  for  log  b  ,  which  is  an  abbreviation 

Q, 

for  ln(b)/ln(a)  ,  and  L*J  is  an  abbreviation  for  entier(a)  .  The 
reader  is  asked  to  believe  that  j  mod  3=1  if  j  was  odd  before 
doing  ,i  :  (,  y  ,i)  -r  2  ,  otherwise  it  is  0  (an  inelegant  jump  over  a 
hurdle  of  ALGOL  *0)  . 
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An  interesting  improvement  to  the  algorithm  capitalizes  on 
Lemma  3.4,  by  avoiding  a  test  following  an  exchange. 

for  i  :=  2  t  j_2  i  (n-1)  J  ,  i  *  2  while  i  >  1  do 

for  j  :=  i,  (3  x  j)  4-  2  while  J  mod  3  =  0  and  j  <  n  do 

for  k  :=  1  step  1  until  j-1  do 

for  l  :=  k  step  j  until  n-j  do 

if  A[i]  >  A[i+j]  then  begin  swap  (A[f  ],A[i+j]) ; 

t  :=  f+j 
end 

Whereas  before  we  simply  corrected  all  p-inversions,  starting  at  the 
left,  we  now  correct  all  the  inversions  of  one  p-chain  before  going  on  to 
the  next.  This  change  is  necessary  if  we  are  to  take  advantage  of 
Lemma  3*4. 

Suppose  the  body  of  the  inner  loop  takes  1  unit  of  time  if 
A[f]  <  A[f+j]  and  2  otherwise  and  that  all  other  operations  are  negligible. 
Then  the  timing  of  this  version  of  the  algorithm  becomes  remarkably 
independent  of  how  well  ordered  the  initial  array  was,  since  for  all  but 
the  last  two  elements  of  each  p-chain,  if  the  body  ever  takes  2 
units,  this  increase  is  offset  by  skipping  the  next  comparison.  Thus 
each  pass  will  require  between  n-p  and  n  units  of  time  (depending  on 
how  many  p-chains  had  their  last  two  elements  inverted)  which  is  quite  a 
small  range  for  most  p's  in  the  sequence,  in  comparison  with  n  ,  for 
reasonably  large  n  . 

The  numbers  2  and  3  are  not  the  only  possible  choice  for  this 
algorithm.  In  fact,  any  set  x^, ...,xm  will  do  if  their  greatest  common 
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divisor  is 


-*-  > 


by  Lemma  2.2.  The  corresponding  sequence  is  the  set  of 


numbers  less  than  n  that  have  only  x.,,...,x  as  factors,  in 

1  m 

descending  order,  say,  giving  a  timing  of  0(n  logm  n)  .  The  Shellsort 
of  Chapter  1  must  be  used,  since  we  now  need  the  inner  loop  again. 
Preliminary  investigation  of  various  sets  seems  to  indicate  that  {2,3} 
is  the  best  choice,  as  far  as  the  upper  bound  is  concerned.  However, 
other  sets  with  two  elements  may  conceivably  give  a  faster  running  time 
on  the  average. 


Chapter  k 

A  Shell  Sorting  Network 

^•1  Sorting  networks 

The  most  interesting  feature  of  the  algorithm  of  Chapter  3  is  that 
it  suggests  a  fast  sorting  network.  A  sorting  network  is  a  set  of 
comparators  wired  such  that  when  an  array  of  numbers  is  presented  to 
the  input  terminals  of  the  network,  the  same  numbers  rearranged  in 
ascending  order  are  presented  at  the  output  terminals.  A  comparator  is 
a  t -  a- input  two-output  sorting  network  whcih  may  be  treated  as  a  black 
box  for  the  purpose  of  designing  sorting  networks  with  than  as  the  basic 
building  blocks.  The  basic  difference  between  a  sorting  network  and  a 
general-purpose  computer  programmed  to  read,  sort  and  output  arrays  is 
that  whereas  the  control  structure  of  the  computer  is  inherently  serial, 
forcing  comparisons  and  exchanges  to  be  done  one  at  a  time,  the  network's 
control  structure  is  defined  by  the  comparators  alone  and  can  be  made 
highly  parallel;  all  that  is  required  in  the  way  of  control  is  that  a 
comparator  start  work  just  when  it  has  received  both  of  its  inputs. 

By  way  of  example,  the  illustrated  network  of  Figure  t.l  will  sort 
four  numbers  with  3  units  of  delay. 


A[l] 


Af^J 


A[3] 

A[M 


Vigure  4.i.  Sorting  Network  for  four  elements. 
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To  show  that  this  particular  network  indeed  sorts,  note  that  output  4 
must  get  the  maximum  of  the  inputs,  and  output  1  the  minimum.  Hence 
A[l]  <  A(2]  ,  and  A[3]  <  A[4]  •  The  last  comparator  guarantees 
At  2]  <  A[  3  ]  ,  and  these  three  relations  then  mean  that  the  array  is 
sorted. 

A  convenient  representation  for  a  comparator  and  its  wiring  is  as  in 
Figure  4.2 : 


-  • 

— - * _ _ 

Figure  4.2.  An  abbreviated  comparator. 

where  the  vertical  line  denotes  the  comparator  and  the  arrow  denotes  the 
direction  the  larger  number  goeB.  Thus  our  previous  example  would  be 
drawn  as  in  Figure  4.3. 


* 

* 


All] 

A[2] 


At3] 

A[4] 


Figure  4. 3.  An  abbreviated  sorting  network  for  four  elements. 


r.ote  that  the  inputs  to  the  final  comparator  as  drawn  here  are  inverted 
with  respect  to  the  corresponding  inputs  in  the  other  diagram.  This  does 
not  affect  its  operation;  indeed  the  two  inputs  to  a  comparator  never  need 
be  distinguished,  since  max  and  min  are  commutative  functions. 
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The  important  thing  about  sorting  networks  is  that  they  permit 
considerable  parallelism,  since  up  to  n/2  comparators  can  be  working 
simultaneously  on  n  lines.  One  might  deduce  from  this  that,  since  there 
exist  serial  algorithms  taking  time  n  log  n  ,  one  could  do  n/2  of 
these  steps  at  a  time  in  a  sorting  network,  taking  O(log  n)  units  of 
time.  This  deduction  breaks  down  because  it  is  ncrt  necessarily  possible 
to  predict  which  n/2  comparisons  to  do  at  any  given  time  without  knowing 
in  advance  the  outcome  of  some  of  those  same  comparisons.  So  far,  the 

best  asymptotic  timing  to  date  has  been  Batcher’s  algorithm  [Batcher  1968], 

1  2 

which  takes  ^  lo82  n  1111  i-t  8  of  d®lay  asymptotically  (where  a  unit  is  the 
time  to  compare  and  exchange  two  elements)  and  uses  J  log^  n  comparators 
asymptotically.  (See  also  Van  Voorhis  [1971].) 

Further  discussion  of  sorting  networks  can  be  found  in  Floyd  and 
Knuth  [1970]. 

^•2  Shellsort  with  standard  comparators 

In  the  algorithm  of  Chapter  3,  where  an  array  was  sorted  by  being 
2p3q  -  sorted  for  all  integers  p,q  >  0  with  2p3q  <  n  ,  it  was  noted 
that  for  each  p  and  q  ,  the  corresponding  2^^  -  sorting  involved  only 
correcting  a  small  number,  I  ?  J  ,  of  adjacent  inversions.  It  is 
possible  to  do  all  of  this  work  in  two  stages  of  parallelism,  by  first 
simultaneously  comparing  every  even-numbered  location  (numbering  the 
elements  of  a  p-chain  1,2, 3,  ...  )  with  its  predecessor,  and  then  doing 
the  same  for  odd-numbered  elements.  For  a  single  p-chain  with  6  elements 
we  would  have  Figure  t.t. 


Figure  U.4.  A  sorting  network  for  a  6  element  p-chain 
that  is  already  2-  and  3-ordered. 

Since  the  p-chains  are  independent  for  a  fixed,  p  ,  we  can 
extend  this  parallelism  to  the  whole  array.  Thus  if  the  above  example 
were  of  a  single  3-chain  in  an  array  of  16  elements,  the  whole 
3-sorting  stage  would  involve  the  network  of  Figure  b.$: 


3a  1  3b 
Figure  I4.5 


where  3a  and  3b  denote  the  first  and  second  stages  of  3-sorting 
respectively. 


A  complete  sorting  network  for  8  elements  would  use  the  characteristic 
sequence  6,  4,  3,  2,  1  ,  as  in  Figure  4.6. 
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Figure  4.6 


From  the  analysis  of  Chapter  3,  we  know  that  there  are  about 
^  (log2  n)(logj  n)  elements  in  the  sequence,  whence  we  need  at  most 

(log2  n)  ( log,  n)  stages  of  parallelism,  each  requiring  one  unit  of  delay. 

1  2 

So  this  network  takes  about  — --g  log2  n  units,  slower  than  Batcher's 

algorithm  by  a  factor  of  1.26  . 

4.3  A  faster  network 

In  the  above  network,  for  each  p  in  the  characteristic  sequence, 
we  had  to  partition  into  two  groups  those  comparators  responsible  for 
p-sorting,  in  order  to  avoid  having  two  comparators  workiig  on  the  same 
line  simultaneously.  But  Lemma  3.4  tells  us  that  two  such  comparators  may 
not  both  swap  their  inputs.  This  suggests  a  conjecture,  that  the  comparators 
can  operate  in  parallel  anyway,  perhaps  with  some  modification  to  the  design 
of  the  comparators,  tfere  this  possible  with  nr  increase  in  delay  of  a 
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comparator,  our  network  would  then  function  with  ~  log^  n  log,  n  units 
of  delay  asymptotically,  which  would  represent  an  improvement  of  a  factor 
of  log^  3  =  1-583  over  the  networks  described  by  Batcher  [1968]. 

Unfortunately,  there  is  no  universally  applicable  way  of  doing  this. 
For  if  there  were,  we  could  apply  it  to  the  particular  case  where  the 
quantities  to  be  sorted  may  take  on  only  the  values  0  and  1  ,  and  the 
only  available  components  are  two- input  AND  and  OR  gates,  each  with  a 
delay  of  1  unit.  Now  we  may  readily  build  a  comparator  for  this  domain 
ss  in  Figures  4.7  and  4.8. 


Figure  4.7.  A  comparator  for  zeroes  Figure  4.8.  Notation, 

and  ones. 


Uince  our  conjecture  refers  only  to  that  part  of  a  sorting  network 
corresponding  to  a  single  element  of'  the  fr  3^'  sequence,  we  need  only 
consider,  in  isolation,  the  realization  of  this  part  as  AND  and  OR  gates. 
Moreover,  within  that  part,  we  only  need  consider  a  single  2^3^  -  chain 
as  in  Figure  4.9,  since  chains  are  not  connected  to  each  other  within 
that  part. 
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(a) 


(b) 


Figure  4.9.  (a) 

(b) 


A  sorting  network  for  one  p-chain  (as  in  Figure  4.4, 
minus  one  input) . 

Implementation  of  Figure  4.9(a)  using  Figure  4.8. 


The  delay  of  this  stage  of  the  sorting  network  as  implemented  in 
Figure  4.9(b)  is  just  2  units.  In  order  to  improve  on  this  we  must  reduce 
the  delay  to  1  unit.  But  then  output  3  of  Figure  4.9(a)  would  have  to  be 
the  output  of  some  gate  with  two  inputs  chosen  from  the  inputs  b  ,  C  ,  D 
and  E  (the  only  inputs  that  could  affect  output  3),  since  only  2-input 
gates  are  available.  But  output  3  is  a  non-trivial  function  of  B  ,  C 
and  D  ,  even  under  the  condition  that  the  input  is  p-ordered  for  all 
p  >1  .  (Consider  BCD  =  001  vs.  101  (for  B  ),  001  vs  Oil  (for  C  ), 
and  010  vs  Oil  (for  D  ).)  Hence  it  is  not  possible  to  reduce  the  delay 
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of  this  stage  to  1  unit.  (The  reader  may  readily  extend  this  result  to 
the  case  when  multiple -input  AND  and  OR  (and  even  NAHD  and  NOR)  gates 
are  available.) 

So  the  best  we  can  hope  for  is  that  under  some  conditions  we  may 
be  able  to  take  advantage  of  Lemma  3.4.  To  show  that  our  conjecture  is 
not  completely  without  grounds  we  shall  give  in  detail  an  pimple  of  a 
situation  much  closer  to  real-life  problems  and  state-of-the-art 
technology  than  the  foregoing  somewhat  artificial  one,  and  show  that  in 

this  situation  we  may  come  closer  to  realizing  the  desired  factor-of-two 
speed-up. 

A  more  common  domain  for  sorting  purposes  is  that  of  the  integers. 

A  common  representation  for  this  domain  is  the  familiar  binary  notation. 
Let  us  assume  that  we  want  to  sort  n  words  of  w  cits  each,  and  that 
each  word  represents  an  unsigned  binary  number. 

One  way  to  implement  a  sorting  network  for  this  situation  is  to  have 
each  comparator  process  all  w  bits  of  each  of  its  two  inputs  in 
parallel  before  outputting  anything.  This  is  fast  but  expensive,  since 
each  bit  requires  some  hardware  of  its  own.  At  first  sight,  serial 
processing  (most  significant  bit  first)  would  seem  to  involve  a  speed 
decrease  of  a  facLor  of  about  w  .  However,  it  should  be  clear  that 
as  soon  as  a  comparator  has  inspected  a  bit  from  each  of  its  inputs,  it 
rna^  pass  those  bits  on  to  the  next  comparator  before  it  has  seen  any  more 
of  its  input,  (a  problem  that  can  arise  here  is  that  a  comparator  may 
not  know  at  some  time  which  of  its  inputs  is  the  maximum.  But  in  this 
case,  all  pairs  of  bits  seen  so  far  must  have  been  equal,  so  that  it 
doesn't  matter  which  way  it  routes  its  output.)  So  the  time  required  for 
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a  serial  network  is  really  just  that  required  by  a  parallel  network, 

£lus  (rather  than  times)  the  time  required  to  pass  w  bits  (to  be  precise, 
v-1  )  through  a  comparator.  Hence  serial  organization  would  appear  to 
be  economically  sound  here,  and  we  shall  assume  it  for  our  example. 

Finally,  let  us  assume  that  we  have  available  NAND  gates  and  NOR 
gates  with  any  number  of  inputs,  and  flipflqps.  While  this  restricted 
repertoire  does  an  injustice  to  the  present  state  of  the  art  of  integrated 
circuits  (where  the  effect  of  parasitic  lead  capacitances  on  timing 
dominates  that  of  AND-OR-INVERT  gate  propagation  delays  in  some  cases), 
it  will  at  least  take  advantage  of  the  reader’s  presumed  familiarity  with 
the  elements  of  traditional  (and  rather  idealized)  switching  theory. 

With  our  assumptions  formulated,  we  shall  exhibit  an  implementation 
of  Figure  4.9(a)  under  these  assumptions. 

Since  the  input  to  the  network  of  Figure  4.9(a)  is  p-ordered  for 
axl  p  >  1  ,  and  since  the  output  is  completely  ordered,  it  follows  that 
each  output  is  just  the  median  of  its  three  closest  inputs;  e.g. 
output  0  is  the  median  of  A  ,  B  and  C  ,  output  is  the  median  of 
B  ,  (  and  n  ,  and  so  on.  Hence  it  suffices  to  implement  Figure  4.9(a) 
as  in  Figure  4.10(a),  where  the  output  of  each  box  labelled  M*  is  the 
median  of  its  inputs.  (The  0  and  1  inputs  at  the  top  and  bottom  are 
fixed  at  those  values  throughout  the  operation,  and  thus  correspond  in 
their  effect  to  v&lues  of  -  ®  Bind  +  ® 


respectively.) 


Fi(;ure  4.10  (a)  Implementation  of  Figure  4.9(a)  using  median-finders . 

(b)  Same  as  (a)  (some  detail  omitted)  with  registers  R  added. 

The  device  M'  must  keep  track  of  the  relations  between  its  inputs, 
because  of  our  serial  organization,  these  relations  are  a  function  of 
time,  in  that  they  depend  on  how  much  of  the  incoming  three  words  has 
been  seen.  It  suffices  to  remember  (say  for  the  M'  connected  to  output  2) 
whether  A  <  B  ,  A  =  B  or  A  >  B  ,  and  whether  B<C,  B  =  C  or  B>C. 
Thus  it  would  seem  that  each  M*  must  be  able  to  distinguish  nine  cases 
(three  of  which  cannot  arise,  as  we  shall  see  later,  leaving  six  cases). 

This  is  wasteful  since  two  adjacent  M*  s  could  share  the  information 
about  their  two  common  inputs.  This  immediately  suggests  Figure  4.10(b) 
as  a  more  economical  organization.  The  connections  of  Figure  4.10(a)  are 
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preserved  (although  their  detail  is  omitted  for  clarity  in  Figure  4.10(b)). 

Box  M'  is  now  replaced  by  bcx  M  ,  which  no  longer  has  the  responsibility 

of  remembering  what  happens  from  one  bit  to  the  next.  Instead,  this 

responsibility  is  delegated  to  the  R  boxes.  Box  M  now  consults  its 

three  inputs  and  two  R  boxes  as  each  new  set  of  3  bits,  A  ,  B  ,  C  , 

arrives.  The  output  from  an  R  box  is  3-valued,  viz.  <  ,  =  or  >  , 

corresponding  to  whether  R  thinks  that  A  <  B  ,  A  =  B  or  A  >  B  . 

There  are  two  approaches  to  the  timing  of  M  .  Either  M  may  wait 

until  the  R  boxes  have  decoded  their  inputs  before  deciding  which  of 

A  ,  B  and  C  is  now  the  median,  or  M  may  decide  to  go  ahead  with  the 

new  A  ,  B  and  C  bits  but  using  the  old  states  of  the  boxes  R.  , 

-  A 

corresponding  to  the  situation  up  to  the  previous  A  ,  B  and  C  bits. 

That  is,  M  may  anticipate  the  next  states  of  the  R  boxes,  without 
waiting  on  them.  The  merit  of  this  approach  is  that  the  delay  of  the 
whole  stage  is  then  reduced  by  an  amount  possibly  as  great  as  the  delay 
of  R  .  We  shall  adopt  this  second  strategy. 

We  may  build  R  boxes  as  in  Figure  4.11. 


Notation 


X 

Y 


X 

Y 


where  X  =  Y 


Figure  4.11.  Implementation  of  an  R  box,  with  2  gates,  2  flipflops. 


Note  that  the  complemented  inputs  A  and  B  may  be  derived  from  A  ,  B 
respectively  using  inverters  (single  input  NAND  or  NOR  gates).  However, 
it  is  highly  likely  that  a  practical  design  will  interpose  a  flipflop 
between  the  output  of  a  median  finder  and  the  next  stage  in  order  to 
control  the  movement  of  bits  through  the  network,  both  to  avoid  one  bit 
catching  up  with  its  predecessor  and  to  ensure  chat  T,he  three  inputs  all 
arrive  simultaneously  at  a  median  finder.  Inherent  in  the  design  of 
flipflops  mode  of  NOR  gates  (the  usual  strategy)  is  the  accessibility 
of  both  the  flipflop' s  output  and  its  complement.  Thus  the  inverters 
would  then  not  be  needed. 

The  two  AND  gates  may  independently  be  replaced  by  NOR  gates,  with 
the  appropriate  changes  to  their  inputs  (that  is,  use  the  complemented 
value  of  each  input  instead).  This  follows  from  De  Morgan's  Law  that 
AB  =  A  +  B  .  We  used  AND  gates  to  aid  the  reader's  understanding  of  the 
circuit's  operation. 

The  notation  for  flipflops  should  be  self-explanatory.  It  was 
suggested  to  us  by  II.  ft  one  and  sidesteps  the  irrelevant  issue  of  whether 
lo  use  tin.-  Q  or  the  l i  output  (a  designation  quite  arbitrarily 
assigned  by  each  flipflop 's  manufacturer,  with  R  (reset)  and  S  (set) 
inputs  then  named  to  correspond  to  this  arbitrary  choice)  to  denote 
come  variable.  Each  half  of  the  flipflop  represents  a  buffer  that 
"remembers''  any  logic  level  of  1  that  arrives  at  its  input.  The 
.juxtaposition  of  the  two  halves  represents  the  fact  that  the  buffer  is 
made  to  "forget"  (i.e.,  return  to  state  0  )  if  a  level  of  1  arrives 


In  the  operation  of  the  R  box,  input  G  (go)  is  momentarily  set 
to  1  prior  to  starting  to  sort,  and  then  returns  to  0  for  the 
remainder  of  the  sorting  operation.  Thus  every  R  box  initially 
supposes  that  both  A  <  B  and  A  >  B  ,  that  is,  A  =  B  . 

By  inspection  of  the  circuit,  as  long  as  A  =  B  at  the  inputs,  the 
state  of  R  will  be  undisturbed.  Suppose  A  =  0  and  B  =  1  at  some 
time.  If  A  =  B  before  this,  then  we  must  have  A  <  B  ,  and  R  digests 
this  fact  by  complementing  the  upper  flipflop.  The  dual  situation 
obtains  if  A  =  1  and  B  =  0  . 

Once  one  of  the  flipflops  has  been  complemented,  it  is  clear  that 
no  further  change  of  state  of  R  is  possible.  The  complemented  flipflop 
will  never  see  another  1  at  its  G  input,  and  the  other  flipflop' s 
input  has  been  turned  off  by  the  complementing.  So  only  three  states 
of  R  are  possible,  corresponding  to  A  =  B,  A<B  and  A  >  B  ,  as 
desired.  These  states  are  communicated  to  the  outside  world  by  four 
outputs  (abbreviated  to  one  in  Figure  k. 10(b))  labelled  <  ,  >  ,  < 

and  >  respectively. 

Let  us  now  proceed  to  a  circuit  for  the  M  boxes,  given  by  Figure 
lt.12.  We  use  AND  and  OR  gates  for  pedagogical  reasons;  an  equivalent 
circuit  may  be  obtained  by  replacing  every  gate  with  a  NAND  gate;  recall 
De  Morgan's  Law  that  A+  B  =  AB  . 


A  * 


B  * 


C  0 ■ 


Figure  4.12.  An  implementation  of  a  median  finder  M  . 

The  unprimed  inequalities  labeling  the  input  terminals  denote  the 
appropriate  outputs  of  the  R  box  that  is  to  be  connected  between  A 
and  B  .  Call  this  R  box  simply  R  .  The  primed  inequalities  are 
for  the  R  box  between  B  and  C  .  Call  this  box  R'  . 

To  verify  that  this  circuit  works,  it  suffices  to  enumerate  the 
possible  pairs  of  states  of  R  and  R*  .  The  details  are  encapsulated 
in  Figure  4.1 J. 
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Three  cases,  (>,>*),  (>,-•)  and  (.,>•)  are  not  show 

because  they  cannot  occur.  Each  of  these  cases  implies  that  a  >C  , 
contradicting  Lemma  3.4.  This  remark  is  independent  of  what  bits  are 
seen  later,  even  if  eventually  (>,=•)  becomes  (>,<•)  ,  say.  The 
explanation  in  this  case  is  that  if  there  is  so  far  no  way  to  distinguish 
B  from  C  ,  yet  R  can  tell  that  A  >  B  for  some  reason,  then  we 

must  deduce  that  A  >  C  for  the  same  reason. 

The  figure  shows  those  inputs  that  are  set  to  1  for  each  of  the  6 
possible  cases.  By  eliminating  those  AND  gates  of  Figure  4.12  that  have 

a  0  input,  and  then  simplifying  the  remaining  circuit,  it  is  easy  to 

arrive  at  the  equivalent  circuits  shown  for  each  case. 

To  verify  that  the  equivalent  circuits  arc  the  desired  ones,  note  " 
that  we  have  essentially  reduced  the  problem  to  the  case  when  the  data  to 
be  sorted  can  have  only  the  two  values  0  and  1  .  The  median  finder's 
responsibility  is  simply  to  decide  which  of  three  bits  is  the  output. 

It  is  the  responsibility  of  R  and  R*  to  decide  which  equivalent  circuit 
is  required  for  any  particular  set  of  3  bits. 

For  the  case  (  ,  ')  ,  wo  clearly  want  the  fullblown  circuit  of 

Figure  Mb).  Inspecting  output  2  of  that  circuit  shows  that  we  have 
the  correct  equivalent  circuit. 

For  the  case  (=><'),  C  cannot  be  the  m-dian,  so  we  want 
max(A,B)  .  Figure  4.8  verifies  this  equivalent  circuit. 

In  the  case  (<,  =')  ,  a  cannot  be  the  median,  so  we  want 
min(B,C)  .  Again  Figure  4.8  confirms  the  circuit. 

The  remaining  cases  correspond  to  B<A<C,  A<B<C  and 
A  <  C  <  B  respectively,  giving  medians  A  ,  B  and  C  respectively.  So 
the  circuit  of  Figure  4.12  does  indeed  work. 
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M  is  independent 


Note  that  if  R  is  in  state  <  ,  the  output  of 

of  input  A  ,  and  similarly  for  input  C  when  R»  is  in  state  <•  ,  as 
can  be  seen  from  Figure  4.13. 

It  follows  that  in  Figure  4.10(b)  the  top  input  of  the  top  M  box 
need  not  be  set  to  0  as  in  Figure  4.10(a),  and  similarly  for  the  very 
bottom  input.  Thus  these  two  inputs  may  be  tied  to  any  convenient 
terminal  in  practice,  provided  the  terminal's  voltage  does  not  interfere 
with  the  otherwise  correct  functioning  of  the  gates  thereby  attached. 

The  crucial  question  now  is  that  of  speed.  In  particular,  how  does 
this  circuit  compare  with  the  fastest  possible  circuit  for  a  standard 
comparator  for  use  in  Batcher's  network?  Any  answer  to  this  will  almost 
certainly  have  to  depend  on  a  detailed  knowledge  of  the  relative  speeds 
of  the  available  devices  for  building  comparators. 

Figure  4.14  exhibits  a  possible  implementation  of  a  comparator. 

The  principle  of  operation  of  the  structure  in  Figure  4.14(a)  is  the 
same  as  that  of  Figure  4.10(b).  The  only  difference  is  that  in  place 
of  the  three-argument  median  finders,  we  now  have  max  and  m^  finders, 
each  with  only  two  arguments.  The  circuits  for  MIN  and  MAX  are 
analogous  to  that  of  Figure  4.12,  and  the  style  of  argument  represented 
by  figure  4.13  carries  over  to  these  circuits  quite  trivially.  As 
before,  NAND  gates  may  be  used  throughout,  it  is  interesting  to  note 
that  although  the  circuit  for  M  was  developed  independently  of  those 
for  MIN  and  MAX,  the  MAX  circuit  is  obtainable  directly  from  the  M 
circuit  by  removing  the  bottom  three  AND  gates  of  M  and  the  <»  input 
of  the  second  AND  gate  (that  is,  everything  to  do  with  input  C  ).  The 
MIN  circuit  is  almost  as  easily  obtained  (together  with  some  simplification) 
by  suppressing  anything  to  do  with  input  A  . 


(b) 


(c) 


Figure  U.lU  (a)  Structure  of  a  comparator. 

(b),(c)  MIN,  MAX  circuits. 


We  conjecture  that  the  circuit  of  Figure  U.lU  is  very  close  to  the 
fastest  possible  for  a  standard  comparator,  using  the  existing  technology 
based  on  NAND  and  NOR  gates.  In  support  of  this,  we  can  prove  that  the 
two-eate  delay  of  this  circuit  cannot  be  reduced  to  a  one-gate  delay. 

For  if  it  could,  each  gate  (necessarily  one  for  each  output)  would  have 
to  be  a  NAND  or  NOR  gate.  But  neither  these  nor  AND  nor  OR  gates  are 
suitable.  Consider  the  MIN  output.  This  cannot  be  the  AND  or  NAND  of 
ihe  inputs  A  and  B  ,  since  there  are  occasions  when  one  of  A  or  B 
is  0  yet  a  1  output  is  required  (e.g.  when  R  knows  A  <  B  ,  the 
current  A  bit  is  1  and  the  B  bit  is  0  )  or  a  0  output  is  required 
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(e.g.  both  A  and  B  are  simultaneously  0  at  some  time).  A  dual 
argument  says  that  the  MIN  output  cannot  be  the  OR  or  NOR  of  the  inputs 
A  and  B  .  A  fortiori,  the  MIN  output  cannot  be  a  single-gate  function 
of  A,  B  and  the  state  of  R  . 

Hence  the  question  of  optimality  of  the  circuits  of  Figure  l*.l4 
involves  mostly  very  technology-dependent  issues  such  as  the  effect  of 
fan-in  and  fan-out  on  gate  propagation  delays,  the  ratio  of  turn-on 
to  turn-off  delays  (quite  significant  with  bipolar  transistor  TTL 
technology)  and  whether  it  is  possible  to  wire-OR  gate  output  (as  with 
tri-state  logic  for  example;  this  gives  the  effect  of  having  OR  gates 
with  no  delay).  Each  gate  in  Figure  li.lU  has  a  fan-in  of  at  most  3  , 
and  a  fan-out  of  at  most  U  .  It  would  seem  unlikely  that  this  could  be 
significantly  improved,  especially  in  view  of  the  fact  that  the  delay  of 
currently  available  gates  as  quoted  by  theii  manufacturers  is  independent 
of  the  fan-in  for  up  to  about  six  inputs,  and  increases  by  about  5  percent 
(for  fast  gates)  for  each  extra  device  loading  the  output,  up  to  a  fan-out 
of  about  10  . 

If  wired-OR  is  possible,  this  gives  all  our  circuits  (except  for  R  ) 
the  effect  of  one  gate  of  delay,  so  the  issue  of  the  availability  of 
wired-OR  logic  would  not  appear  to  significantly  damage  our  conjecture. 

The  issue  of  turn-on/turn-off  delays  is  probably  too  transistor-dependent 
to  be  worth  discussion  here.  The  reader  is  challenged  (if  he  is 
interested  in  technology-dependent  arguments)  to  try  to  show  constructively 
that  the  ratio  of  turn-on  to  turn-off  delays  (within  reasonable  limits) 
affects  our  conjecture. 
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Of  course,  none  of  this  is  very  relevant  if  the  delay  of  R 
exceeds  that  of  the  other  devices.  In  this  case,  our  median-finder 
is  as  fast  as  our  comparator  (ignoring  the  fan-out  of  R  for  the  moment), 
and  our  comparator  in  turn  is  probably  close  to  optimal,  in  view  of  the 
triviality  of  the  circuit  for  R  .  Taking  the  fan-out  of  R  into 
consideration,  this  is  at  most  2  for  each  output  from  R  in  Figure  4.14 
(counting  the  connections  within  R  ),  and  at  most  3  in  Figure  4.12 
(provided  we  are  using  NOR  gates  in  R  ;  with  AND  gates  as  shown,  the 
fan-out  of  the  <  and  >  outputs  becomes  4  ).  So  a  delay  of  at  most 
5  percent  that  of  a  gate  (we  can  build  flipflops  from  NOR  gates),  and 
hence  less  than  3  percent  of  the  whole  circuit,  is  about  the  main 
difference  in  timing  between  these  circuits. 

In  the  event  that  R  turns  out  to  be  faster  than  our  median  finder, 
we  need  to  show  that  the  latter  is  not  much  slower  than  our  MAX  and  MIN 
circuits.  The  only  significant  difference  is  that  the  fan-out  of  the 
output  of  M  is  5  more  than  that  for  our  comparator  outputs  (6  if 
we  don't  have  flipflops  at  the  output,  for  then  R  will  require  two 
inverters).  To  get  around  this  disparity,  we  can  "move  the  fan-out  bock 
a  gate",  by  duplicating  or  triplicating  the  circuitry  for  the  OR  gate  in 
each  of  our  circuits,  at  the  cost  of  increasing  the  fan-out  of  the  AND 
gates.  The  optimum  appears  to  be  triplication  for  M  and  duplication 
for  each  of  MIN  and  MAX,  independently  of  whether  we  use  flipflops 
between  stages.  (The  flipflops  if  they  are  present  must  be  duplicated 
along  with  the  Oh  gates.)  Without  flipflops,  the  optimized  "accumulated 
fan-out"  (maximum  fan-out  of  any  AND  gate  plus  maximum  fan-out  of  any  OR 
gate)  is  7  for  M  (3  for  the  ANDs,  4  for  the  ORs)  and  5  for 
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MIN/MAX  (2  for  ANDs,  3  for  ORs) .  With  flipflops,  it  is  6  for  M 
(3  for  the  ANDs,  3  for  the  ORs)  and  U  for  MIN/MAX  (2  for  ANDs, 

2  for  ORs) .  The  situation  for  M  with  flipflops  is  shown  in  Figure  U.15. 
In  both  cases  the  difference  between  M  and  MIN/MAX  i*j  2  ,  corresponding 
to  a  difference  in  delay  of  about  10  percent  of  a  gate,  or  at  most 
5  percent  of  a  comparator  without  flipflops,  even  less  for  one  with 
flipflops . 


1 


3+3+3  =  9=  total  load  of  next 
stage 


Figure  I1.I3.  M  with  triplicated  OR  rates  and  flipflop  buffers. 
(Not  all  AND  gates  shown.) 


In  conclusion,  there  seems  little  reason  to  doubt  that  with 
state-of-the-art  technology,  we  can  build  median-finders  whose  speed  is 
within  5  percent  of  the  speed  of  the  best  comparators.  Thus,  using  our 
2P3q  network,  we  may  improve  on  Batcher's  network  by  a  factor  of  between 
1.5  and  I.585  . 


35 


Chapter  5 
Epilogue 

5.1.  Summary  and  Suggested  Problems. 

For  each  chapter,  we  shall  summarize  Its  result-  and  suggest 
problems  associated  with  that  chapter. 

3/2 

In  Chapter  2,  we  gave  an  upper  bound  (namely  0(n  ))  on  the  worst- 

case  time  for  Shellsorts  that  use  "fuzzy"  geometric  progressions  with 
short  coprime  subsequences  throughout.  In  addition  we  showed  that  when 
these  progressions  had  an  Integer  common  ratio,  the  upper  bound  could 
not  be  improved  other  than  to  within  a  constant  factor.  This  leaves 
open  the  following  problems. 

1.  What  Is  the  constant  factor  (as  a  function  of  the  given 

characteristic  sequence)  for  the  worst  case  of  Shellsort  with  the 

Integer-common-ratio  sequences? 

3/2 

2.  Can  the  0(n  )  bound  be  Improved  If  the  ratio  Is  not  an 

Integer,  but,  say,  /2? 

3.  What  other  properties  do  Shellsorts  with  geometric  sequences 
have?  For  example,  what  is  the  mean  and  the  variance  of  the  time  for 
Shellsort  with  Hibbard's  sequence,  given  some  frequency  distribution 
for  the  data? 

3/2 

In  Chapter  3,  we  showed  that  0(n  )  is  certainly  not  the  ultimate 

fate  of  Shellsort.  We  did  this  by  exhibiting  one  sequence  for  which 

2 

Shellsort  takes  time  0(n  log  n) .  Some  problems  this  raises  are: 
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4.  What  la  the  ultimate  apeed  of  which  Shellaort  la  capable? 

(la  0(n  log2n)  the  beat  possible?). 

5.  What  la  the  average  time  for  Shellaort  using  sequences  of  the 
form  2p5<1,  etc.?  Is  It  better  or  worse  than  that  for  the  2*3**  sequence? 

In  Chapter  4,  we  converted  the  aerial  algorithm  of  Chapter  3 
Into  a  highly  parallel  one.  Our  arguments  were,  unfortunately,  based 
on  the  state-of-the-art  of  the  electronics  Industry.  We  showed  that 
there  was  no  universal  way  to  eliminate  this  dependency,  by  describing 
a  rather  trivial  environment  where  our  method  failed  to  compete  with 
Batcher's  method.  This  raises  these  questions. 

6.  What  environments  less  trivial  than  the  domain  (0,1}  alto 
handicap  our  method? 

7.  Are  there  environments  for  which  our  method  Is  still  better 
than  Batcher's,  but  only  by,  say,  a  factor  of  1.2? 

8.  What  Is  the  advantage  of  our  method  when  we  can  afford  to 
build  parallel  comparators?  (This  costs  many  times  more,  with  a 
disproportionately  small  return  on  the  investment,  making  this  question 
of  interest  mainly  to  the  very  rich.) 

9.  Is  It  a  coincidence  that  all  attempts  to  build  faster  sorting 
networks  have  resulted  in  networks  that  take  time  0(log2n),  or  Is  this 
the  asymptotic  lower  bound? 
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5.2.  Conclusions  and  Perspective. 

The  unifying  basis  for  this  thesis  is  the  sorting  technique 
described  by  Shell  [1959],  generalized  of  course  to  consider  a  larger 
class  of  characteristic  sequences  for  Shellsort  than  the  one  considered 
in  Shell's  original  paper.  The  sequences  we  considered  in  detail  could 
be  classified  respectively  as  first-  and  second-order  geometric  progressions, 
where  an  m-th  order  progression  has  m  distinct  ways  of  generating  new 
elements  of  the  progression  from  old  ones  (e.g.  multiplying  by  either  2 
or  3,  as  in  the  second-order  geometric  progression  of  Chapter  3). 

The  behavior  of  Shellsort  is  strikingly  different  for  first-order 

geometric  progressions  as  opposed  to  higher  order  ones.  In  the  former 

3  /2 

case,  as  remarked  in  Section  2.1,  Shellsort  takes  time  0(n^  )  using 

2 

perturbed  progressions,  but  time  0(n  )  using  an  unperturbed  sequence 
of,  say,  powers  of  two.  The  theorems  and  remarks  of  Chapter  3  depend 
t'or  their  proof  on  the  higher-order  sequences  remaining  unperturbed. 

The  questions  answered  in  Chapters  2  and  3  we  of  academic  interest 
only,  since  there1  already  exist  sorting  techniques  which,  on  theoretical 
grounds  alone,  are  as  good  as  Shellsort,  and  which  on  empirical  evidence 
are  much  better  for  almost  all  applications.  Chapter  4  gives  a  most 
interesting  exception  to  this  rule,  in  that  we  show  that  in  practice 
Shellsort  is  the  best  method  to  use  for  sorting  networks,  at  least 
from  the  point  of  view  of  speed.  This  is  not  to  say  that  Shellsort 
will  always  be  better  than  Batcher's  method;  a  way  of  building 
considerably  faster  comparators,  which  does  not  apply  to  our  median- finders, 
could  upset  this  claim.  But  the  arguments  presented  in  Chapter  4  seem  to 
indicate  that  a  different  technology  would  be  required  for  this  to  happen. 
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